Search CORE

2 research outputs found

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Author: Bijwadia Shaan
Chang Shuo-yiin
He Yanzhang
Li Bo
Sainath Tara
Zhang Chao
Publication venue
Publication date: 01/11/2022
Field of study

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).Comment: To be published in Spoken Language Technology Workshop (SLT) 202

arXiv.org e-Print Archive

Text Injection for Capitalization and Turn-Taking Prediction in Speech Models

Author: Bijwadia Shaan
Chang Shuo-yiin
Meng Zhong
Sainath Tara N.
Wang Weiran
Zhang Hao
Publication venue
Publication date: 14/08/2023
Field of study

Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall

arXiv.org e-Print Archive